# DESIGN AND DEVELOPMENT OF UNSIGNED MULTIPLIERS WITH LOW ENERGY CONFIGURABLE ERROR RECOVERY

<sup>1</sup>P MOUNIKA, <sup>2</sup>K HANUJA, <sup>3</sup>Dr SARDAR KHAME SINGH <sup>1</sup>Assistant professor, <sup>2</sup>Associate professor, <sup>3</sup>Professor <sup>1, 2, 3</sup>ECE Department, St. Martin's Engineering College, Secunderabad

## ABSTRACT

This paper, we focus on hardware-level approximation by introducing thepartial product perforation technique for designing approximatemultiplication circuits. We prove in a mathematically rigorousmanner that in partial product perforation, the imposed errors are bounded and predictable, depending only on theinput distribution. Through extensive experimental evaluation, we apply the partial product perforation method on differentmultiplier architectures and expose the optimal architecture-perforation pairs different configuration for error constraints.We show that, compared with the respective exact design, thepartial product perforation delivers reductions of up to 50% in power consumption, 45% in area, and 35% in critical delay.In addition, the product perforation method is compared with thestate-of-the-art approximation techniques, i.e.. truncation. voltageoverscaling, and logic approximation, showing that it outperforms them in terms of power dissipation and error.

This multiplier leverages a newly designed approximate adder that limits its carry propagation to the nearest neighbours for fast partial product accumulation. Different levels of accuracy can be achieved by using either OR gates or the proposed approximate adder in a configurable error recovery circuit.

#### 1. INTRODUCTION

minimization is one of the Energy mainrequirements in almost any electronic systems, especially the portable ones such as smart phones,tablets, and different gadgets. It is highly desiredto achieve this minimization with minimalperformance (speed) penalty. Digital signalprocessing (DSP) blocks adders and multipliersare key components of these portable devices forrealizing various multimedia applications. The computational core of these blocks is thearithmetic logic unit where multiplications have he greatest share among all arithmetic operationsperformed in these DSP systems. Therefore, improving the speed and power/energy-efficiencycharacteristics of multipliers plays a key role inimproving the

efficiency of processors. Many of the DSP cores implement image and videoprocessing algorithms where final outputs areeither images or videos prepared for humanconsumptions. This fact enables touseapproximations for improving us the speed/energyefficiency. This originates from the limitedperceptual abilities of human beings in observingan image or a video. In addition to the image andvideo processing applications, there are otherareas where the exactness of the arithmeticoperations is not critical to the functionality of thesystem. Being able to use the approximate computing provides the designer with the abilityof making trade-offs between the accuracy and thespeed as well as power/energy consumption.

Applying the approximation to the arithmeticunits can be performed at different designabstraction levels including circuit, logic, andarchitecture levels, as well as algorithm andsoftware layers.

APPROXIMATE computing has emerged as a potential solution for the design of energy-efficient Applications digital systems[1]. such as multimedia, recognition anddata mining are inherently error-tolerant and do not requirea perfect accuracy in computation. For digital signal processing(DSP) applications, the result is often left to interpretationby human perception. Therefore, strict exactness may notbe required and an imprecise result may suffice due to thelimitation of human perception. For these applications, approximatecircuits play an important role as a promising alternative for reducing area, power and delay, thereby achieving betterperformance in energy efficiency. As one of the key components in arithmetic circuits, addershave been extensively studied for approximate implementation[2]-[8]. As the typical carry propagation chain is usually shorter than the width of an adder, the speculative adders usea reduced number of less significant input bits to calculate the sum bits [2]. An error detection and recovery schemehas been proposed in [3] to extend the scheme of [2] for areliable adder with variable latency. A reliable variable-latencyadder based on carry select addition has been presented in [8]. As a number of approximate adders have been proposed, newmethodologies to model, analyze and evaluate them have beendiscussed.

The proposed multiplier can be configured into two designsby using OR gates and the proposed approximate adders forerror reduction, referred to as approximate multiplier 1 (AM1)and approximate multiplier 2 (AM2), respectively. Differentlevels of error recovery can also be achieved by using a differentnumber of MSBs for error recovery in both AM1 and AM2.As per the analysis, the proposed multipliers have significantlyshorter critical paths and lower power dissipationthan the traditional Wallace Functional and multiplier. circuitsimulations are performed to evaluate the performance of the multipliers. Image sharpening and smoothing are considered as approximate multiplication-based DSP applications.Experimental results indicate that the proposed approximatemultipliers perform well in these error-tolerant applications. The proposed designs can be used as effective library cellsfor the synthesis of approximate circuits.

### 2. LITERATURE SURVEY

Ohban et al. (2002) exploit a transition activity optimizationtechnique, namely hardware bypassing. Since adding zero partial productgenerates a large number of signal transitions in the carry-adder array without affecting the results, the authors propose to dynamically bypass such additionby disabling the adders. This row-bypassing technique saves upto 27% oftransitions in comparison to the traditional multiplier design. Wen et al.(2005)proposes a low power parallel multiplier design with column bypassingtechnique, in which some columns in the multiplier array can be turned-offwhenever their outputs are known. ie, the operations in a column can bedisabled if the corresponding bit in the multiplicand is 0. The circuit overheadof the column bypassing scheme is smaller than that of the row bypassing scheme. Here only one multiplexer per adder cell is needed (compared with 2in the row bypassing scheme). In general, bypassing technique is a genericarchitecture design and does not require elaborate transistor size tweaking asneeded in other delay sensitive schemes. There is no need for extra clocksignals and delay cells either.

Chong et al. (2005) describe a low-voltage micro power 16×16-bit2"s complement array multiplier that features low switching operation.Authors attain the micro power attribute by changing most of the adders in theAdder Block by Latch Adders (LAs). In the LAs, the novelty is the realization f latches as an integral part of the adder, resulting in small power andhardware overheads. These integrated latches are effectively placed in theinput of the adders, and serve to synchronize the inputs to the adder. With the latch function and simple delay circuits, the inputs are synchronized to theadders in the adder block in a predetermined chronological sequence,

therebysubstantially reducing the spurious switching.

Wu et al. (2005) have designated a  $64 \times 64$ -bit high performancemultiplier based on multiplexer cells which is implemented with passtransistor logic. A multiplexer-select Booth encoder is implemented toincrease the speed and reduce the hardware cost. Moreover, a partitioned method is introduced in the design to save the propagate time of final adder.Realistic simulation using extracted timing parameters from the layout expressions show that the propagation time of the critical path of the boothencoder is reduced to 50% compared to conventional design.

Danysh& Tan (2005) presented the design and implementation of avector multiplier-accumulate (MAC) unit that can perform one  $64 \times 64$ , two $32 \times 32$  bit, four  $16 \times 16$ , or eight  $8 \times 8$  signed/unsigned multiply-accumulatesusing essentially the hardware as a 64-bit MAC and without significantlymore delay. The concept of "shared segmentation" is introduced in which the existing scalar hardware structure is segmented and then shared between

vector modes. In the case of the MAC, the scalar architecture is "vectorized"by inserting modedependent masking into the partial product generation andby injecting mode-dependent kills in the carry chain of the reduction tree andfinal carrypropagate adder.

Lee et al. (2005) presented new bit-parallel dual basis multipliersusing the modified Booth"s algorithm (MBA). The proposed multiplierinherits the advantage of the MBA and then reduces both space and timecomplexities. A multiplexer-based structure is proposed for realization of theproposed algorithm. The authors have shown that their multiplier saves about9% space complexity as former multipliers, compared to if the generatingpolynomial is trinomial or all one polynomial. Furthermore, authors claim thatthe proposed multiplier is faster.

### 3. PROPOSED APPROXIMATE MULTIPLIERS

The layout of multiplier is in particular focusing at the reduction of the partial product switch may additionally lessen the region, postpone and strength consumption of the multiplier. In the approximate multiplier layout the adders are changed with Compressors to obtain the higher overall performance. With the usage of approximate compressors as opposed to using the exact compressors the circuit complexity of the multiplier is reduced. The proposed approximate multipliers are implemented with the half adder or complete adder is grouped to the following discount level. The MSB bits are used inside the approximation element and the LSB's are used inside the truncation part and those are brought within the next discount levels. The ultimate bits are applied to the partial product discount tree and these are used inside the subsequent addition system. Therefore a simplified multiplier designed

with less wide variety of adders and these can produce the outputs at very high speed.



Fig. 1: Partial product reduction using truncation and the proposed approximate compressors for a multiplier.

With the usage of the approximate part and the truncation element the end result will obtained with less energy intake and with decreased hardware. With the use of correct compressors the bits will lessen the loss of accuracy.

The layout of the multiplier includes three ranges in those tiers the generation of the partial merchandise are the primary degree and the reduction of the partial merchandise will takes place in the 2nd degree and inside the 1/3 degree the final addition is accomplished. The generation of the partial products and the addition of the partial products may be finished efficaciously with using the approximate compressors and those can lessen the strength intake.

In the proposed multiplier the possibility of the partial products am,n are obtained from the statically factor of view. If there are more wide variety of partial products am,nandan,m are mixed to shape propagate and generate signals. The partial merchandise are obtained from the altered partial products are pm,n and gm,n. The partial merchandise am,n and an,m are changed with the generate and propagate signals which might be generated from the altered partial products.

$$\mathbf{p}_{\mathbf{m},\mathbf{n}} = \mathbf{a}_{\mathbf{m},\mathbf{n}} + \mathbf{a}_{\mathbf{n},\mathbf{m}}$$

$$g_{m,n} = a_{m,n} \cdot a_{n,m}$$

The generate alerts from the altered partial products having the possibility of one is being 1/16, that's decrease than the chance of the partial merchandise generated by means of the am,n. The chance of the am,nof being one is <sup>1</sup>/<sub>4</sub>. Hence the partial merchandise obtained from the altered partial products gain less energy intake. When we're making use of the approximation to the partial products they are able to acquire the higher overall performance. In the partial manufacturing discount tree the OR gates are used in the accumulation stages and these can generate and propagate the outputs with the possibility of errors.

The chance of mistakes is obtained by using the OR gates and these are used for the discount of generate and propagate alerts. When the number of propagate indicators are increasing the chance of mistakes also increases linearly. Hence the value of the error additionally will increase. To lessen the opportunity of mistakes the maximum number of bits are propagated the usage of or gates consequently the generate signals are reduced. The partial products are gathered with the possibility of generate and propagate indicators that are acquired from the altered partial products. In the buildup level the approximate 1/2 adder, complete adder and 4-2 compressors are used. With the usage of those approximate adders the bring bits will propagate faster and the approximate adders will generate outputs. Hence the sum and convey bits are once more accumulated to the following reduction level along side the truncated bits. With the reduction of the partial products the opportunity of error additionally decreased. The sum and convey bits are propagated by way of using the following equations

$$Sum = x1 + x2$$
  
Carry = x1 · x2.

The approximate full adder the XOR gates are replaced with or gates to generate the sum. With the change of the whole adder operation there's an error incidence inside the last levels, this produces the difference among the authentic and approximate values.

$$W = (x1 + x2)$$
  
Sum = W  $\bigoplus$  x3  
Carry = W  $\cdot$  x3.

In the approximate compressor design there ought to be 4 inputs and it will give 3 outputs. Here the 3 outputs are one in handiest one circumstance out of the all viable conditions. To remove this minimum errors distinction is calculated and it's miles given as one for the closing one feasible condition. Hence for this the sum computation can be modified and it is given in the following equation.

$$W1 = x1 \cdot x2$$
  

$$W2 = x3 \cdot x4$$
  

$$Sum = (x1 \bigoplus x2) + (x3 \bigoplus x4) + W1 \cdot W2$$
  

$$Carry = W1 + W2.$$

#### Approximation in the partial product tree:

A broken array multiplier (BAM) is used for the addition of the partial products with high speed. The BAM operates on high speed because of the usage of some carry-save adders which are used in array multipliers in both the directions. The error tolerant multiplier (ETM) is split into a two types of multiplication for the MSB's and LSB's. A NOR gate based control block is required for the two conditions:

i) If the result of the MSBs is zero, at that point the augmentation segment is enacted to duplicate the LSBs without any approximation, and

ii) If the result of the MSBs is one, the non-increase area is utilized as an inexact multiplier to process the LSBs, while the augmentation segment is enacted to duplicate the MSBs.

A power and territory productive surmised Wallace tree multiplier (AWTM) is planned. A n-bit AWTM is actualized by four n/2-bit submultipliers, and the most huge n/2-bit submultiplier is additionally executed by four n/4-bit sub-multipliers. The AWTM is arranged into four unique modes by the quantity of surmised n/4-bit sub-multipliers in the most huge n/2-bit submultiplier. The estimated halfway items are then amassed by a Wallace tree.

16 × 16 Approximate Multipliers

In both AM1 and AM2, all the error vectors are compressed to one error vector, which is then added back to the approximate output of the partial product tree. Compared to  $8 \times 8$  designs,  $16 \times 16$ multipliers generate more error vectors, and too much information would be ignored if the same error reduction strategies are used. That is, using only one compressed error vector does not make a good estimation of the overall error. In the modified designs, the error vectors generated by the approximate adders are compressed to two final error vectors. Take a  $16 \times 16$  AM1 as an example, the eight error vectors generated at the first stage of the partial product accumulation tree are compressed to one error vector, EV1, using OR gates. The remaining seven error vectors from the second, third and fourth stages are compressed to

another error vector EV2. Then both EV1 and EV2 are added back to the output of the partial product at the fourth stage. Similarly, the proposed approximate adders are used in a  $16 \times 16$  AM2 to compress the eight error vectors from the first stage to one error vector and the remaining error vectors to another error vector.

#### CONCLUSION

This paper proposes a high-performance and lowpowerapproximate partial product accumulation tree for a multiplierusing a newly designed approximate adder. The proposed approximate ignores the carry propagation adder hv generatingboth an approximate sum and an error signal. OR gate and approximate adder based error reduction schemes are utilized, yielding two approximate  $8 \times 8$  multiplier different designs:AM1 and AM2. Moreover, modifications are made on the error reduction schemes for  $16 \times 16$ multiplier designs, suchthat TAM1 and TAM2 are obtained by truncating 16 LSBsof the partial products. An green 16 bit approximate multiplier is designed via the use of the altered partial merchandise which are generated via the use of the opportunity. Approximate adders are used to reduce the altered partial merchandise and decrease the partial merchandise the usage of partial production discount tree. With the use of approximate 1/2 adder, full adder and four-2 compressor the proposed multiplier achieves the higher pace in comparison to the previous multipliers.

#### REFERENCES

[1] V. Gupta, D. Mohapatra, A. Raghunathan, andK. Roy, "Low-power digital signal processingusing approximate adders," IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 32, no. 1,pp. 124–137, Jan. 2013.

[2] E. J. King and E. E. Swartzlander, Jr., "Datadependenttruncation scheme for parallelmultipliers," in Proc. 31st Asilomar Conf. Signals, Circuits Syst., Nov. 1998, pp. 1178–1182.

[3] K.-J. Cho, K.-C.Lee, J.-G.Chung, and K. K.Parhi, "Design of low-error fixed-width modifiedbooth multiplier," IEEE Trans. Very Large ScaleIntegr. (VLSI) Syst., vol. 12, no. 5, pp. 522–531,May 2004.

[4] H. R. Mahdiani, A. Ahmadi, S. M. Fakhraie, and C. Lucas, "Bio-inspired imprecisecomputational blocks for efficient VLSIimplementation of softcomputing applications,"IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 57, no. 4, pp. 850–862, Apr. 2010.

[5] A. Momeni, J. Han, P. Montuschi, and F.Lombardi, "Design and analysis of approximatecompressors for multiplication," IEEE Trans.Comput., vol. 64, no. 4, pp. 984–994, Apr. 2015.

[6] S. Narayanamoorthy, H. A. Moghaddam, Z.Liu, T. Park, and N. S. Kim, "Energyefficientapproximate multiplication for digital signalprocessing and classification applications," IEEETrans. Very Large Scale Integr. (VLSI) Syst., vol.23, no. 6, pp. 1180–1184, Jun. 2015.

[7] G. Zervakis, K. Tsoumanis, S. Xydis, D.Soudris, and K. Pekmestzi, "Designefficientapproximate multiplication circuits through partialproduct perforation," IEEE Trans. Very LargeScale Integr. (VLSI) Syst., vol. 24, no. 10, pp.3105–3117, Oct. 2016.

[8] M.Anand,M.Saritha,M.Janaki "Performance of efficient CMOS power amplifier for ISM band applications" International Journal of Innovative Technology and Exploring Engineering,Vol 9,Issue 2 pp.4579-4584,Dec 2019.

[9] C.-H. Lin and C. Lin, "High accuracyapproximate multiplier with error correction," inProc. IEEE 31st Int. Conf. Comput. Design, Sep.2013, pp. 33–38.

[10] C. Liu, J. Han, and F. Lombardi, "A lowpower,high-performance approximate multiplierwith configurable partial error recovery," in Proc.Conf. Exhibit. (DATE), 2014, pp. 1–4.